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MODEL-BASED CLUSTERING OF LARGE NETWORKS 

By Duy Quang Vu and David R. Hunter and Michael Schweinberger 

We describe a network clustering framework, based on finite mix- 
ture models, that can be applied to discrete-valued networks with 
hundreds of thousands of nodes and billions of edge variables. Rela- 
tive to other recent model-based clustering work for networks, we in- 
troduce a more flexible modeling framework, improve the variational- 
approximation estimation algorithm, discuss and implement standard 
error estimation via a parametric bootstrap approach, and apply 
these methods to much larger datasets than those seen elsewhere 
in the literature. The more flexible modeling framework is achieved 
through introducing novel parameterizations of the model, giving 
varying degrees of parsimony, using exponential family models whose 
structure may be exploited in various theoretical and algorithmic 
ways. The algorithms, which we show how to adapt to the more com- 

■ plicated optimization requirements introduced by the constraints im- 
' posed by the novel parameterizations we propose, are based on vari- 
ational generalized EM algorithms, where the E-steps are augmented 
by a minorization-maximization (MM) idea. The bootstrapped stan- 
dard error estimates are based on an efficient Monte Carlo network 
simulation idea. Last, we demonstrate the usefulness of the model- 
based clustering framework by applying it to a discrete-valued net- 
work with more than 131,000 nodes and 17 billion edge variables. 
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1. Introduction. According to Fisher (1922, p. 311), "the object of statistical methods is 
the reduction of data." The reduction of data is imperative in the case of discrete-valued net- 

■ works that may have hundreds of thousands of nodes and billions of edge variables. The collec- 
tion of such large networks is becoming more and more common, thanks to electronic devices 
such as cameras and computers. Of special interest is the identification of influential subsets of 
nodes and high-density regions of the network with an eye to break down the large network 
into smaller, more manageable components. These smaller, more manageable components may 
be studied by more advanced statistical models, such as advanced exponential family models (e.g., 
Frank and Strauss, 1986; Strauss and Ikeda, 1990; Wasserman and Pattison, 1996; Snijders et al., 
2006; Hunter and Handcock, 2006). 

An example is given by signed networks, such as trust networks, which arise in World Wide 
Web applications. Users of internet-based exchange networks are invited to classify other users as 
either —1 (untrustworthy) or +1 (trustworthy). Trust networks can be used to protect users and 
enhance collaboration among users (Kunegis et al., 2009; Massa and Avesani, 2007a). A second 
example is the spread of infectious disease through populations by way of contacts among individuals 
(Britton and O'Neill, 2002; Groendyke et al., 2011). In such applications, it may be of interest to 
identify potential super-spreaders — i.e., individuals who are in contact with many other individuals 
and who could therefore spread the disease to many others — and dense regions of the network 
through which disease could spread rapidly. 

The current article advances the model-based clustering of large networks in at least four ways. 
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First, we introduce a simple and flexible statistical framework for parameterizing models based on 
statistical exponential families (e.g., Barndorff-Nielsen, 1978) that advances existing model-based 
clustering techniques. Model-based clustering of networks was pioneered by Snijders and Nowicki 
(1997). The simple, unconstrained parameterizations employed by Snijders and Nowicki (1997) and 
others (e.g., Nowicki and Snijders, 2001; Airoldi et al., 2008; Daudin et al., 2008; Zanghi et al., 
2010; Mariadassou et al., 2010) make sense when networks are small, undirected, and binary, and 
when there are no covariates. In general, though, such parameterizations may be unappealing from 
both a scientific point of view and a statistical point of view, as they may result in non-parsimonious 
models with hundreds or thousands of parameters. An important advantage of the statistical frame- 
work we introduce here is that it gives researchers a choice: They can choose interesting features of 
the data, specify a model capturing those features, and cluster nodes based on the specified model. 
The resulting models are therefore both parsimonious and scientifically interesting. 

Second, we introduce approximate maximum likelihood estimates of parameters based on novel 
variational generalized EM (GEM) algorithms, which take advantage of minorization-maximization 
(MM) algorithms (Hunter and Lange, 2004) and have computational advantages. In the presence 
of parameter constraints, we facilitate computations by exploiting the properties of exponential 
families (e.g., Barndorff-Nielsen, 1978). In applications to sparse networks, the running time of 
the variational GEM algorithm is 0(n), whereas conventional variational EM algorithms (e.g., 
Airoldi et al., 2008; Daudin et al., 2008; Zanghi et al., 2010; Mariadassou et al., 2010) have running 
time 0(n 2 ). In addition, we sketch how the variational GEM algorithm can be extended to obtain 
approximate Bayesian estimates. 

Third, we introduce bootstrap standard errors to quantify the uncertainty about the approximate 
maximum likelihood estimates of the parameters, whereas other work has ignored the uncertainty 
about the approximate maximum likelihood estimates. To facilitate these bootstrap procedures, we 
introduce Monte Carlo simulation algorithms that generate sparse networks in much less time than 
conventional Monte Carlo simulation algorithms. In fact, without the more efficient Monte Carlo 
simulation algorithms, obtaining bootstrap standard errors would be infeasible. 

Finally, while model-based clustering has been limited to fewer than 13,000 nodes and 85 million 
edge variables (see the largest data set handled to date, Zanghi et al., 2010), we demonstrate that 
we can handle directed, non-binary networks with more than 131,000 nodes and 17 billion edge 
variables. 

The paper is structured as follows: Model-based clustering based on finite mixture models is 
introduced in Section 2. Approximate maximum likelihood and Bayesian estimation are discussed 
in Sections 3 and 4, respectively, and an algorithm for Monte Carlo simulation of large networks 
is described in Section 5. The application to the discrete-valued network with more than 131,000 
nodes and 17 billion edges variables is presented in Section 6. 

2. Models for large, discrete-valued networks. We consider n nodes, indexed by integers 
1, . . . , n, and edges y^ between pairs of nodes i and j, where y^ can take values in a finite set of 
M elements. By convention, ya = for all i, where signifies "no relationship." We consider the 
set of all edges y^ to be a discrete-valued network, which we denote by y, and we let y denote the 
set of possible values of y. Special cases of interest are (a) undirected binary networks y, where 
yij S {0, 1} is subject to the linear constraint yij = yji for all i < j; (b) directed binary networks y, 
where y^ € {0, 1} for all and (c) directed signed networks y, where y^ € {—1, 0, 1} for all 

A general approach to modeling discrete-valued networks is based on exponential families of 
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distributions (Besag, 1974; Frank and Strauss, 1986): 

(2.1) P e {Y = y | x) = exp[e T g(x, y) - ^(0)], y € % 

where is the vector of canonical parameters and g(x,y) is the vector of canonical statistics 
depending on a matrix x of covariates, measured on the nodes or the pairs of nodes, and the 
network y, and ip{0) is given by 

(2.2) = log exp[e r g(x,y%ee R p 

and ensures that Pg(Y = y\x) sums to 1. 

A number of exponential family models have been proposed (e.g., Holland and Leinhardt, 1981; 
Frank and Strauss, 1986; Wasserman and Pattison, 1996; Snijders et al., 2006; Hunter and Handcock, 
2006). In general, though, exponential family models are not scalable: the computing time to eval- 
uate the likelihood function is exp(iV log M), where N = n{n — l)/2 in the case of undirected edges 
and N = n{n — 1) in the case of directed edges, which necessitates time-consuming estimation 
algorithms (e.g., Snijders, 2002; Hunter and Handcock, 2006; M0ller et al., 2006; Koskinen et al., 
2010; Caimo and Friel, 2011). 

We therefore restrict attention to scalable exponential family models, which are characterized by 
dyadic independence: 

ii 

(2.3) P g (y = y\x) = Y[P (D ij = d ij \x), 

i<j 

where Dij = DijiY) corresponds to Yy in the case of undirected edges and (Yij,Yji) in the case of 
directed edges. 

Dyadic independence has at least three advantages: (a) it facilitates estimation, because the 
computing time to evaluate the likelihood function scales linearly with iV; (b) it facilitates simula- 
tion, because dyads are independent; and (c) by design it bypasses the so-called model degeneracy 
problem: if N is large, some exponential family models without dyadic independence tend to be 
ill-defined and impractical for modeling networks (Strauss, 1986; Handcock, 2003; Schweinberger, 
2011). 

A disadvantage is that most exponential families with dyadic independence are either simplistic 
(e.g., models with identically distributed edges, Erdos and Renyi, 1959; Gilbert, 1959) or non- 
parsimonious (e.g., the pi model with 0{n) parameters, Holland and Leinhardt, 1981). 

We therefore assume that the probability mass function has a if-component mixture form as 
follows: 

P^ e (Y = y\x) = Y, P o( Y = y\x,Z = z) P^(Z = z) 

(2.4) 

= J2HPe(D l] = d ij \x,Z = z)P 1 (Z = z), 

zez i<j 

where Z denotes the membership indicators Z±,..., Z n with distributions 



(2.5) 



Zi\li,-- ■ ,7ic~Multinomial(l;7i, . . .,jk) 
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and Z denotes the support of Z. In some applications, it may be desired to model the membership 
indicators Z\ as functions of x by using multinomial logit or probit models with Zi as the outcome 
variables and x as predictors (e.g., Tallberg, 2005). We do not elaborate on such models here, but 
the variational GEM algorithms discussed in Sections 3 and 4 could be adapted to such models. 

Mixture models represent a reasonable compromise between model parsimony, model complexity, 
and computational complexity. In particular, the assumption of conditional dyadic independence 
does not imply marginal dyadic independence, which means that the mixture model of Equa- 
tion (2.4) captures some degree of dependence among the dyads. We give two specific examples of 
mixture models below. 

Example 1. The p\ model of Holland and Leinhardt (1981) for directed, binary- valued networks 
may be modified using a mixture model. The original p\ models the sequence of in-degrees (num- 
ber of incoming edges of nodes) and out-degrees (number of outgoing edges of nodes) as well as 
reciprociated edges, postulating that the dyads are independent and that the dyadic probabilities 
are of the form 

(2.6) Pe(Dij = dij) = exp [(a; + /3j) y i3 + (ay + ft) yji + pyijVji - ipij(0)} , 

where = (a%, . . . , a n , (3i, . . . , /3 n , p) and exp{— tpij(0)} is a normalizing constant. A drawback of 
this model is that it requires 2n + l parameters. Here, we show how to extend it to a mixture model 
that is applicable to both directed and undirected networks as well as discrete- valued networks, 
that is much more parsimonious, and that allows identification of influential nodes who dominate 
the network. 

Observe that the dyadic probabilities of Equation (2.6) are of the form 

(2.7) Pe(Dij = d^) cx exp[0j gi (d tj ) + 6»J g 2 {d ij ) + 0j g 2 (dij)]- 

A mixture model modification of the p\ model postulates that conditional on Z, the dyadic prob- 
abilities are independent and of the form 

(2.8) P e (Dij = d^ | Z ik = Zji = 1) cx exp[0j gi (dij) + Oj k g 2 (dij) + 6>J g 2 (d ij )}, 

where the parameter vectors 9 2 k and 2 i depend on the components k and I to which the nodes 
i and j belong, respectively. The mixture model version of the p\ model is therefore much more 
parsimonious provided K <C n and was proposed by Schweinberger et al. (2011) in the case of 
undirected, binary-valued networks. Here, the probabilities of Equations (2.7) and (2.8) are appli- 
cable to both undirected and directed networks as well as discrete-valued networks, because the 
functions g\ and g 2 may be customized to fit the situation and may even depend on covariates x, 
though we have suppressed this possibility in the notation. Finally, the mixture model version of 
the p\ model admits model-based clustering of nodes based on indegrees or outdegrees or both. A 
small number of nodes with high indegree or outdegree or both is considered to be influential: If 
the corresponding nodes were to be removed, the network structure would be impacted. 

Example 2. The mixture model of Nowicki and Snijders (2001) assumes that, conditional on Z, 
the dyads are independent and the conditional dyadic probabilities are of the form 



(2.9) 



Pn(Dij = d\Z ik = Zji = 1) = Kd;kl- 
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In other words, conditional on Z, the dyad probabilities are constant across dyads and do not depend 
on covariates. It is straightforward to add covariates by writing the conditional dyad probabilities 
in canonical form: 



where the canonical statistic vectors gi(x,dij) and g2(x,dij) may depend on the covariates x. If 
the canonical parameter vectors Oki are constrained by the linear constraints Oki = Ok + Oi, where 
Ok and 0[ are parameter vectors of the same dimension as Oki, then the mixture model version of 
the pi model arises. In other words, the mixture model version of the p\ model can be viewed as a 
constrained version of the Nowicki and Snijders (2001) model. While the constrained version can be 
used to cluster nodes based on degree, the unconstrained version can be used to identify, for instance, 
high-density regions of the network, corresponding to subsets of nodes with large numbers of within- 
subset edges. These regions may then be studied individually in more detail by using more advanced 
statistical models such as exponential family models without dyadic independence as proposed by, 
for example, Holland and Leinhardt (1981), Frank and Strauss (1986), Strauss and Ikeda (1990), 
Wasserman and Pattison (1996), Snijders et al. (2006), or Hunter and Handcock (2006). 

Other examples. Other mixture models for networks have been proposed by Tallberg (2005), 
Handcock et al. (2007), and Airoldi et al. (2008). However, these models scale less well to large 
networks, so we confine attention here to Examples 1 and 2. 

3. Approximate maximum likelihood estimation. A standard approach to maximum 
likelihood estimation of finite mixture models is based on the classical EM algorithm, taking the 
complete data to be (Y,Z), where Z is unobserved (Dempster et al., 1977). However, the E-step 
of an EM algorithm requires the computation of the conditional expectation of the complete data 
log likelihood function under the distribution of Z \Y, which is intractable even in the simplest 
possible cases (Daudin et al., 2008). 

As an alternative, we consider so-called variational EM algorithms, which can be considered to 
be generalizations of EM algorithms. The basic idea of variational EM algorithms is to construct a 
tractable lower bound on the intractable log likelihood function and maximize the tractable lower 
bound, which gives rise to approximate maximum likelihood estimates. In recent work, Celisse et al. 
(2011) have shown that approximate maximum likelihood estimators along these lines are — at least 
in the absence of parameter constraints — consistent estimators. 

We assume that all modeling of Y can be conditional on covariates x and define 



However, for ease of presentation, we drop the notational dependence of nd;ij,ki,n on x an< ^ ma ke 
the homogeneity assumption 



which is satisfied by the models in Examples 1 and 2. Exponential parameterizations of iTd;kl{Q)i 
as in Equations (2.6) and (2.10), may or may not be convenient. An attractive property of the 
variational EM algorithm proposed here is that it can handle all possible parameterizations of 
Tfd;ki(Q)- I n some cases (e.g., Example 1), exponential parameterizations are more advantageous 
than others, while in other cases (e.g., Example 2), the reverse holds. 



(2.10) 



Pe(Dij = dij | x, Z ik = Zji = 1) oc exp [Oj gt(x, dij) + 0j t g 2 (x, dy)] 



Kd;ij,uA e ) = p e{Dij = d\Z ik = = l,x). 
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3.1. Variational EM algorithm. Let A(z) = P(Z = z) be an auxiliary distribution with support 
Z. Using Jensen's inequality, the log likelihood function can be bounded below as follows: 



logP J>0 (Y = y) 



y,z 



A(z) 



-A(z) 



(3.2) 



> 



E 

zgZ 



log 



P ly0 {Y = y,Z = z) 
A(z) 



A(z) 



E A [logP lfi {Y = y,Z = z)]- E A [logA(Z)}. 



Some choices of A{z) give rise to better lower bounds than others. To see which choice gives rise to 
the best lower bound, observe that the difference between the log likelihood function and the lower 
bound is equal to the Kullback-Leibler divergence from A(z) to P lt0 (Z = z\Y = y): 



E 

z£Z 



log 



(3.3) 



log P lt6 (Y = y) 

Y J [^gP J AY = y)]A(z)-Y J 

A(z) 



P J>0 (Y = y,Z = z) 



A(z) 



zez 



zdZ 



A{z) 

P Jt0 (Y = y,Z = z) 



E 

z£Z 



log 



P lt0 (Z = z\Y = y) 



log ' 
\{z). 



A(z) 



A(z) 



If the choice of A(z) were unconstrained in the sense that we could choose from the set of all 
distributions with support Z, then the best lower bound is obtained by the choice A{z) = P lt0 (Z = 
z | Y = y), which reduces the Kullback-Leibler divergence to and makes the lower bound tight. If 
the optimal choice is intractable, as is the case here, then it is convenient to constrain the choice to 
a subset of tractable choices and substitute a choice which, within the subset of tractable choices, is 
as close as possible to the optimal choice in terms of Kullback-Leibler divergence. A natural subset 
of tractable choices is given by introducing the auxiliary parameters a = (cki, . . . ,q„) and setting 



(3.4) 



A(z) 



P n (Z 



f[P<*i(Zi 



i=l 



where the marginal auxiliary distributions P ai (Z{ = z{) are Multinomial (1; an, . . . ,chk)- In this 
case, the lower bound may be written 



LB ML (-y,0;a) 



(3.5) 



E a [logP^ (Y = y,Z = z)}- E a [logP a (Z)} 

n K K n K 

^2^2^2 a ikaji log 7^.^(0) + ^2^2a ik (log7 fe -loga ifc ) 

i<j k=l 1=1 i=l k=l 



Because Equation (3.4) assumes independence, the Kullback-Leibler divergence between P a (Z = z) 
and P-y^ {Z = z\Y = y), and thus the tightness of the lower bound, is determined by the depen- 
dence of the random variables Z\, . . . , Z n conditional on Y . If the random variables Z\, . . . , Z n 
are independent conditional on Y, then, for each i, there exists on such that P ai (Zi = Zi) = 
P~y i0 (Zi = Zi \Y = y), which reduces the Kullback-Leibler divergence to and makes the lower 
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bound tight. In general, the random variables Z\, . . . , Z n are not independent conditional on Y 
and the Kullback-Leibler divergence (3.3) is thus positive. 

Approximate maximum likelihood estimates of 7 and 9 can be obtained by maximizing the 
lower bound in (3.2) using variational EM algorithms of the following form, where t is the iteration 
number: 

E-step: Letting and 9^ denote the current values of 7 and 9, maximize LBMiif^, # , <*) 
with respect to a. Let o;(* +1 ) = at'+^^W, flw) denote the optimal value of a and compute 
E a(t+l) [logP^ e (Y = y,Z = z)}. 

M-STEP: Maximize E a ( t +i) [log P 1 ^{Y = y,Z = z)\ with respect to 7 and 9, which is equivalent 
to maximizing LBml(i>9', oc( t+1 ^) with respect to 7 and 9. 

The method ensures that the lower bound is non-decreasing in the iteration number: 

(3.6) L% i ( 7 (<) ,0 W ;Q (t) ) < L5 A/L (7 W ,0 W ;a (m) ) 

(3.7) < LS Mi ( 7 (m) ,0 (m) ;a (m) ), 

where inequalities (3.6) and (3.7) follow from the E-step and M-step, respectively. 

It is instructive to compare the variational EM algorithm to the classical EM algorithm as applied 
to finite mixture models. The E-step of the variational EM algorithm minimizes the Kullback-Leibler 
divergence between A(z) and P_,(t) g(t){Z = z \ Y = y). If the choice of A(z) were unconstrained, 
then the optimal choice would be A(z) = P„(t) g{t) (Z = z \ Y = y). Therefore, in the unconstrained 
case, the E-step of the variational EM algorithm reduces to the E-step of the classical EM algorithm, 
so the classical EM algorithm can be considered to be the optimal variational EM algorithm. 

3.1.1. Generalized E-step: An MM algorithm. To implement the E-step, we exploit the fact that 
the lower bound is non-decreasing as long as the E-step and M-step increase the lower bound. In 
other words, we do not need to maximize the lower bound in the E-step and M-step. Indeed, in- 
creasing rather than maximizing the lower bound in the E-step and M-step may have computational 
advantages when n is large. A variational EM algorithm that increases rather than maximizes the 
lower bound resembles a generalized EM (Dempster et al., 1977) and may therefore be called a 
variational generalized EM, or variational GEM, algorithm. 

Direct maximization of LBml(i^\ # j a) is computationally unattractive: Equation (3.5) shows 
that the lower bound depends on the products ctji and therefore fixed-point updates of along 
the lines of Daudin et al. (2008) would depend on all other a,-;. Although we did not undertake a 
comprehensive comparison of algorithms using full maximization at each E-step via the fixed-point 
updates with our variational GEM idea, anecdotal evidence suggests that for smaller networks 
(with a few thousand nodes), the fixed-point approach works well, often faster in terms of total 
computing time than GEM. However, for large networks such as the ePinions dataset of Section 6, 
the advantage appears to go to GEM, with the fixed-point method failing to finish in the eight-day 
window allotted on the computing cluster we used and using the algorithmic settings we used, with 
a maximum of 100 fixed-point iterations for every E-step. 

To separate the parameters of the maximization problem, we increase LBml(i^\ 1 ol) via an 
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MM algorithm (Hunter and Lange, 2004) by introducing the surrogate function 



(3.8) 




V Va ifc [ log 7® - log affj 




which we show in Appendix A to have the following two properties: 

(3.9) Q A/L ( 7 (i) ,0 W ,<*W;a) < LB ML {~i® , 0® ; a) for all a, 

(3.10) QML{l {t \e^\ aW;aW) = LS ML ( 7 W , « W )- 

In the language of MM algorithms, conditions (3.9) and (3.10) establish that Qjwx(7 \ # j ct^; ct) 
is a minorizer of LBml(i^\ j at) at a^. The theory of MM algorithms implies that maximiz- 
ing the minorizer with respect to a forces LBml(i^\9^ uphill (Hunter and Lange, 2004). 
This maximization, involving nK separate univariate quadratic functions under the constraints 
Y2k=i a ik = 1 f° r each i, may be accomplished quickly using the method described by Stefanov 
(2004). In particular, it is much easier to update a by maximizing the Qml function, which com- 
pletely separates the a parameters into the sum of functions of the individual ctik, than by max- 
imizing the LB ml function when n is large. We therefore arrive at the following replacement for 
the E-step: 

Generalized E-step: For i = 1, . . . , n, increase Qml(j^ > 0®, ot®; a) as a function of on subject 
to X^fcLi a ik = 1- Let af denote the new value of a. 

3.1.2. More on the M-step. To maximize LBn.jL(^f,9', a ^ t+1 ^) in the M-step, examination of 
Equation (3.5) shows that maximization with respect to 7 and 6 may be accomplished separately. 
In fact, for 7, there is a simple, closed-form solution: 

(3.11) it +i) = 1 -±ar\k=i,..., K . 

Concerning 6, if there are no constraints on 7r(<?) other than J^eD 7Vd-ki(9) = 1, it is preferable to 
maximize with respect to n = tt(6) rather than 6, because there are closed-form expressions for 
7j-(* +1 ) but not for 6^ t+l \ Maximization with respect to ir is accomplished by setting 

n 

(t+l) (i+1) 



(3.12) vr^ = ^ , d €LD, k,l = 1,. . . ,K. 

i<j 

Remark 1. If the homogeneity assumption (3.1) does not hold, then closed- form expressions for 
7r may not be available. In some cases, e.g., in the presence of categorical covariates, closed form 
expressions for n are available, but the dimension of 7r, and thus computing time, increases with 
the number of categories. 
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Remark 2. If Equations (2.1) and (2.3) hold, then the exponential parametrization ir(0) may be 
inverted to obtain an approximate maximum likelihood estimate of 6 after the approximate MLE of 
7r is found using the variational GEM algorithm. One method for accomplishing this inversion ex- 
ploits the convex duality of exponential families (Barndorff-Nielsen, 1978; Wainwright and Jordan, 
2008) and is explained in Appendix B. 

Remark 3. If, in addition to the constraint XldeD 7r d;fez(^) = 1 ? additional constraints on 7r are 
present, the maximization with respect to it may either decrease or increase computing time. 
Linear constraints on it can be enforced by Lagrange multipliers and reduce the dimension of 7r 
and thus computing time. Non-linear constraints on 7T, as in Example 1, may not admit closed 
form updates of 7r and thus may require iterative methods. If so, and if the non-linear constraints 
stem from exponential family parameterizations of tt(0) with natural parameter vector 9 as in 
Example 1, it is convenient to translate the constrained maximization problem into an uncon- 
strained maximization problem by maximizing LBML(l,d;a^ t+1 ^) with respect to and exploit 
the fact that LB M l{i, ; a (t+1) ) is a concave function of 6 owing to the exponential family mem- 
bership of 7rd;fcz(0) (Barndorff-Nielsen, 1978, p. 150). We show in Appendix C how the exponential 
family parameterization can be used to derive the gradient and Hessian of the lower bound of 
LBml(i ',0; a( t+1 ^) with respect to 6, which we exploit in Section 6 using a Newton-Raphson 
algorithm. 

3.2. Standard errors. Although we maximize the lower bound LBml^^^cx) of the log like- 
lihood function to obtain approximate maximum likelihood estimates, standard errors of the ap- 
proximate maximum likelihood estimates ■y and 9 based on the curvature of the lower bound 
LBml{i,6] a ) may be too small. The reason is that even when the lower bound is close to the 
log likelihood function, the lower bound may be more curved that the log likelihood function 
(Wang and Titterington, 2005); indeed, the higher curvature helps ensure that LBml(i,0] a ) is a 
lower bound of the log likelihood function log P^^iY = y) in the first place. As an alternative, we 
approximate the standard errors of the approximate maximum likelihood estimates of -f and by 
a parametric bootstrap method (Efron, 1979) that can be described as follows: 

(1) Given the approximate maximum likelihood estimates of 7 and 6, sample N data sets. 

(2) For each data set, compute the approximate maximum likelihood estimates of 7 and 9. 

In addition to fast maximum likelihood algorithms, the parametric bootstrap method requires fast 
simulation algorithms. We propose such an algorithm in Section 5. 

3.3. Starting and Stopping. As usual with EM-like algorithms, it is a good idea to use multiple 
different starting values with the variational EM due to the existence of distinct local maxima. We 
find it easiest to use random starts in which we assign the values of and then commence with 
an M-step. This results in values and 0^°\ then the algorithm continues with the first E-step, 
and so on. The initial af^ are chosen independently uniformly randomly on (0, 1), then each af^ 

is multiplied by a normalizing constant chosen so that the elements of sum to one for every i. 

The numerical experiments of Section 6 used 100 random restarts each. Ideally, more restarts 
would be used, yet the size of the datasets with which we work makes every run somewhat expensive. 
We chose the number 100 because we were able to parallelize on a fairly large scale, essentially 
running 100 separate copies of the algorithm. Larger numbers of runs, such as 1000, would have 
forced longer run times since we would have had to run some of the trials in series rather than in 
parallel. 
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As a convergence criterion, we stop the algorithm as soon as 

\LBml (7 (m) , (m) ; <* (m) ) - LB ml ( 7 (t) , {t) ; Q (t) ; 
LB ML (-y( t + 1 ), ;q(*+ 1 ) ) 



< 10 



-10 



The convergence criterion is based on the relative change of the objective function rather than the 
absolute change, because (1) even small changes in the parameter values can result in large changes 
of the objective function, and (2) the objective function is a lower bound of the log likelihood, and 
small absolute changes of the objective function may not be worth the computational effort. 

4. Approximate Bayesian estimation. The key to Bayesian model estimation and model 
selection is the marginal likelihood, defined as 



(4.1) 



P(Y = y) 



PiA Y = V,Z = z) p(j, 6) d 7 d 0, 



where p{~y,6) is the prior distribution of 7 and 6. To ensure that the marginal likelihood is well- 
defined, we assume that the prior distribution is proper, which is common practice in mixture 
modeling (McLachlan and Peel, 2000, Chapter 4). A lower bound on the log marginal likelihood 
can be derived by introducing an auxiliary distribution with support Z x T x 0, where T is the 
parameter space of 7 and is the parameter space of 8. A natural choice of auxiliary distributions 
is given by 



(4.2) 



A a (z,<y,0) 



.4 = 1 



Pa-Si) 



(«Z,1, • • • , (*z,r 



where a denotes the set of auxiliary parameters corresponding to cxz 
and otg. 

A lower bound on the log marginal likelihood can be derived by Jensen's inequality: 



(4.3) 



log P(Y = y) = log 



r Je 



E 



P^e(Y = y,Z = z)p( 1 ,9) 



A a (z,7,0)d 7 d0 



> E a [logP^ e (Y = y,Z = z)p( 7 ,0)] - E a \iogA a (Z,j,e)], 



where the expectations are taken with respect to the auxiliary distribution A a (z,y, 0). 

We denote the right-hand side of (4.3) by LBs(ct-y, cte; olz)- By an argument along the lines of 
(3.3), one can show that the difference between the log marginal likelihood and LBb(q.j, otg; olz) is 
equal to the Kullback-Leibler divergence from the auxiliary distribution A a (z, 7, 0) to the posterior 
distribution P(Z = z, 7, \ Y = y): 



(4.4) 



\ogP{Y = y) 

z&Z L 



E 



log 

A a (zn,o) 



P{Z = z, 1 ,6\Y = y) 



P^o(Y = y ,z = z) P ( 1 ,ey 

A a (z,j,G) 
A a (z,y,0)d>yae. 



^(2,7,0) d 7 d0 



The Kullback-Leibler divergence between the auxiliary distribution and the posterior distribution 
can be minimized by a variational GEM algorithm as follows, where t is the iteration number: 
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Generalized E-step: Letting aty and 0$ denote the current values of 07 and ag, increase 



, (Xq ; az) with respect to az- Let at z ~ tl ' ) denote the new value of az- 



Generalized M-step: Choose new values a^ +1 ^ and Qg I+i; that increase LBs(a^,ag; a ( z 
with respect to 07 and ag. 

By construction, iteration t of a variational GEM algorithm increases the lower bound LBsia^, ag; az) 



.,(*+!) 



(4.5) 
(4.6) 



rn (At) At).Aty 
LBB\a 1 , a e , a z t 



< LB B {a%\af-,a%^) 

< LB B {a { ' +l \a { a t+l) -a% +l) ) 



A variational GEM algorithm addresses two problems at the same time: It approximates the 
marginal likelihood as well as the posterior distribution. Therefore, it tackles Bayesian model esti- 
mation and model selection at the same time. 

Variational GEM algorithms for approximate Bayesian inference are only slightly more compli- 
cated to implement than the variational GEM algorithms for approximate maximum likelihood 
estimation presented in Section 3. To understand the difference, we examine the analogue of Equa- 
tion (3.5): 



K K 



(4.7) 



LBB(a-y,ag; az) 



= ^^^2ct Z ,ikOtz,jlE a [logir dij . k i(0)] + E a [logP y (Z = z)] 

i<j k=i i=i 

+ E a [log p( 7 , 0)]-E a [log A(Z = z, 7, 0)} . 



If the prior distributions of 7 and are given by independent Dirichlet and Gaussian distributions 
and the auxiliary distributions of Zi,...,Z n , 7, and are given by independent Multinomial, 
Dirichlet, and Gaussian distributions, respectively, then the expectations on the right-hand side 
of (4.7) are tractable, with the possible exception of the expectations E a \log iTd\ki(Q)]- Whether 
the expectations are tractable depends on the parameterization of ir^kii®)- Under the exponential 
parameterization 



(4.8) n d . k i 
the expectations can be written as 



(6) = exp J 6 T g(d) - log £ exp [e T g(d')] 1 , 



(4.9) 



E a [\ogir d , k i{0)} = E a [0] T g{d) - E a \ log ^ ex P \ 0T 9{d' 



d'eD 



and are intractable. We are not aware of parameterizations under which the expectations are 
tractable. We therefore use exponential parameterizations and deal with the intractable nature 
of the resulting expectations by invoking Jensen's inequality: 



(4.10) 



E a [logir d . kl (0)] > E a [6] T g{d) - log £ E a {exp [o T g(d')} } . 



d'ev 



The expectations on the right-hand side of (4.10) are expectations of independent log- normal ran- 
dom variables, which are tractable. We may obtain a looser, yet tractable, lower bound by replacing 
the intractable expectation E a [logn d - k i(9)] in Equation (4.7) by the right side of Equation (4.10). 
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To save space, we do not address the specific numerical techniques that may be used to implement 
the variational GEM algorithm here. In short, the generalized E-step is based on an MM algorithm 
along the lines of Section 3.1.1. In the generalized M-step, numerical gradient-based methods may 
be used. A detailed treatment of this Bayesian estimation method, using a more complicated prior 
distribution, may be found in Schweinberger et al. (2011). 

5. Monte Carlo simulation. Monte Carlo simulation of large, discrete- valued networks serves 
at least three purposes: 

(a) to generate simulated data to be used in simulation studies; 

(b) to approximate standard errors of the approximate maximum likelihood estimates by para- 
metric bootstrap; 

(c) to assess model goodness-of-fit by simulation. 

A crude Monte Carlo approach is based on sampling Z by cycling through all n nodes and sampling 
Dij | Z by cycling through all n(n — l)/2 dyads. However, the running time of such an algorithm is 
0(n 2 ), which is too slow to be useful in practice, because each of these goals listed above tends to 
requires numerous simulated datasets. 

We propose Monte Carlo simulation algorithms that exploit the intrinsic sparsity of discrete- 
valued networks. Discrete-valued networks tend to be sparse in the sense that one element of D 
dominates all other elements of T>. An example is given by directed, binary-valued networks, where 
T> = {(0,0), (0, 1), (1,0), (1, 1)} is the sample space of dyads and (0,0) G T> tends to dominate all 
other elements of D. 

Assume there exists an element b of D, called the baseline, that dominates the other elements 
of D in the sense that tt^m S> 1 — ^b;ki for all j and k. The Monte Carlo simulation algorithm 
exploiting the sparsity of large, discrete-valued networks can be described as follows: 

(1) Sample Z by sampling M ~ Multinomial(ra; 71, . . . , jk) and assigning nodes l,...,Mx to 
component 1, nodes Mi + 1, . . . , Mi + M2 to component 2, etc. 

(2) Sample Y \ Z as follows: For each 1 < k < I < K , 

(a) generate Ski ~ Binomial(A r M, 1 — n^ki), where Nki is the number of pairs of nodes 
belonging to components k and I; 

(b) sample S^i pairs of nodes i < j with replacement from among the Nf-i by rejection 
sampling; 

(c) for each i < j belonging to the pair of components k and /, sample according to the 
probabilities ir d - k i/(l - 7r 6; w)> deD, d^b. 

6. Application. We demonstrate the simplicity and flexibility of the model-based clustering 
framework introduced above by applying it to a data set with more than 131,000 nodes and 17 billion 
edge variables. We consider both constrained and unconstrained network model-based clustering 
and compare them to traditional model-based clustering. 

The data were collected by maintainers of the website epinions . com, which allows users to review 
goods and services. Massa and Avesani (2007b) collected data on n = 131,827 users. Readers of the 
reviews at epinions . com face uncertainty because the number of reviewers is large and almost all 
reviewers are unknown; it is not clear which reviewers users should trust. It is, therefore, desirable 
to obtain an indication of users' trustworthiness based on evaluations by other users. The website 
collects such data by allowing any user i to evaluate any other user j as either untrustworthy, coded 
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as yij = —1, or trustworthy, coded as yij = +1. Most values of yij will be zero, indicating that i 
has not evaluated j. The resulting network includes N = n(n — 1), or more than 17 billion, edge 
variables. 

To address the problem of trustworthiness, we employ the model-based clustering framework 
introduced above by allowing cluster membership to determine the value of the parameter corre- 
sponding to the "excess" statistics 

n 

ei(y) = ^2,Vjh 

equal to the number of positive ratings received by user i in excess of the number of negative 
ratings, a natural measure of this user's individual trustworthiness. A model that captures individual 
trustworthiness along with reciprocity is given by 



P e {Dij = dij | Z ik = Z jt = 1) oc exp 0- yT + 9+ y± + 0" yj. + 9 + y± 

(6-1) 

+o£v» + ofvij + f > + • 

where y~j = I(yij = —1) and y~t = I(yij = 1) are indicators of negative and positive edges, 
respectively. The parameters in model (6.1) are not identifiable, because y^ = yf- — y~- and y^ = 
yf. — y~ We therefore constrain the positive edge parameter + to be 0. Exploiting the flexibility 
afforded by this modeling framework, model (6.1) assumes in the interest of model parsimony that 
the propensities to form negative and positive edges and to reciprocate negative and positive edges 
do not vary across categories. 

We assume that the number of categories is five, a choice motivated by the fact that many 
internet-based companies and websites, such as amazon.com, google.com, and netflix.com, let 
reviewers award 1 to 5 stars to products, services, and service providers. 

For the sake of comparison, we also consider two alternative model-based clustering methods for 
the same dataset. One is a univariate method based on the assumption that the individual excesses 
are independently sampled from a mixture of normal densities. Traditional univariate approaches 
like this are less suitable than network-based clustering approaches for at least two reasons. First, 
the individual excesses are not independent, because the individual excesses are functions of edges 
and edges may be dependent owing to reciprocity (and other forms of dependence, not modeled 
here), which decades of research (e.g., Davis, 1968; Holland and Leinhardt, 1981) have shown to be 
important in shaping social networks. Traditional model-based clustering methods, by construction, 
assume that the individual excesses are independent and are therefore less suitable than network 
model-based clustering methods, which are capable of taking dependence into account. Second, 
the topology of networks may be of interest, in which case an approach that ignores this topology 
is less suitable than network-based clustering methods; in particular, the normal-mixture-model 
parameter estimates do not reveal anything about the network structure, whereas the parameters 
in our network model are directly interpretable. The five-component mixture-of-normal model that 
we consider here assumes that the individual excesses ei(y) are independent random variables 
sampled from a distribution with density 

(6.2) f{x) = j2\~* (X ^ 
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(a) Log likelihood lower bound 



(b) Cluster-specific excess parameters 



Fig 1. (a) Trace plot of the lower bound LBml(7''', 0'*' ; a'*') of the log likelihood function and (b) cluster- specific 
excess parameters 6^ , using 100 runs with random starting values 



where Xj, fj,j, and Cj are component-specific mixing proportions, means, and standard deviations, 
respectively, and <j){-) is the standard normal density. 

The other alternative clustering method we consider is the unconstrained network model of 
Equation (2.9). With five components, this model comprises four mixing parameters Ai, . . . , A4 in 
addition to the itd-,ki parameters, of which there are 105: There are nine types of dyads d whenever 
k 7^ I, contributing 8(2) =80 parameters, and six types of dyads d whenever k = I, contributing 
an additional 5(5) = 25. 

We used a variational GEM algorithm to estimate the network model (6.1), where the M-step was 
executed by a Newton-Raphson algorithm using the gradient and Hessian derived in Appendix C 
with a maximum of 100 iterations. It stopped earlier if the largest absolute value of the gradient 
vector was less than 10~ 10 . By contrast, the unconstrained network model (2.9) employs a vari- 
ational GEM algorithm using the exact M-step update (3.12). The variational GEM algorithm 
stopped when either the relative change in the objective function was less than 10~ 10 or 6000 itera- 
tions were performed. For the unconstrained model, most runs required the full 6000 iterations. To 
estimate the normal mixture model (6.2), we used the R package mixtools (Benaglia et al., 2009). 

To diagnose convergence of the algorithm for fitting model (6.1), we present the trace plot of the 
lower bound of the log likelihood function LBml(i^\ j at^') in Figure 1(a) and the trace plot of 
the cluster-specific excess parameters 6^ in Figure 1(b). Both figures are based on 100 runs, where 
the starting values were obtained by the procedure described in Section 3.3. The results suggest that 
all 100 runs seem to converge to roughly the same solution. This fact is somewhat remarkable, since 
many variational algorithms appear very sensitive to their starting values, converging to multiple 
distinct local optima (e.g., Daudin et al., 2010; Salter- Townshend and Murphy, 2010). For instance, 
the 100 runs for the unconstrained network model (2.9) produced essentially a unique set of values 
for each set of random starting values. Similarly, the normal mixture model algorithm produced 
many different local maxima, even after we tried to correct for label-switching by choosing random 
starting values fairly tightly clustered by their mean values. 

Figure 2 shows the observed excesses ei(y), . . . , e n (y) grouped by clusters for the best solutions, 
as measured by likelihood or approximate likelihood, found for each clustering method. It appears 
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(a) Network mixture model (6.1) 
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(b) Normal mixture model (6.2) 
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(c) Unconstrained network mixture model (2.9) 



Fig 2. Observed values of excess trust ei(y), grouped by highest-probability component of i, for (a) parsimonious 
network mixture model (6.1) with 8 parameters, (b) normal mixture model (6.2) with H parameters, and (c) uncon- 
strained network mixture model (2.9) with 109 parameters. 



that the clustering based on the parsimonious network model is superior to the clusterings for 
the other two models, which are similar to each other. In addition, if we use a normal mixture 
model in which the variances are restricted to be constant across components, the results are even 
worse, with one large cluster and multiple clusters with few nodes. In Figure 3, we "ground truth" 
the clustering solutions using external information: The average rating of each article, categorized 
by the most likely category assigned to its author. Again, the parsimonious 8-parameter network 
mixture model seems to be superior to both the normal mixture model and the unconstrained 109- 
parameter network mixture model; in fact, the latter does not even preserve the correct ordering 
of the categories by median average rating. 

The numerical results of our estimation algorithm are shown in Table 1. The 95% confidence 
intervals reported in that table are obtained by simulating 500 networks using the method of 
Section 5 and the parameter estimates obtained via our algorithm. For each network, we then ran 
our algorithm for 1000 iterations starting at the M-step, where the a parameters were initialized 
to reflect the "true" component to which each node was assigned by the simulation algorithm by 
setting ajfc = 10~ 10 for k not equal to the true component and = 1 — 4 x 10~ 10 otherwise. This 
was done to eliminate the so-called label-switching problem, which is rooted in the invariance of 
the likelihood function to switching the labels of the 5 components and which can affect bootstrap 
samples in the same way it can affect Markov chain Monte Carlo samples from the posterior of 
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(c) Unconstrained network mixture model (2.9) 

Fig 3. Average ratings of 659,290 articles, grouped according to the highest-probability category of the article's author, 
for (a) parsimonious network mixture model (6.1) with 8 parameters, (b) normal mixture model (6.2) with 14 param- 
eters, and (c) unconstrained network mixture model (2.9) with 109 parameters. The ordering of the five categories, 
which is the same as in Figure 2, indicates that the unconstrained network mixture model does not even preserve the 
correct ordering of the median average ratings. 

finite mixture models (Stephens, 2000). The sample 2.5% and 97.5% quantiles form the confidence 
intervals shown. In addition, we give density estimates of the five trustworthiness bootstrap samples 
in Figure 4. 

While we are encouraged by the fact that bootstrapping is at all feasible for problems of this 
size, there are aspects of our investigation that will need to be addressed with further research. 
First, the bootstrapping is so time-consuming that we were forced to rely on computing clusters 
with multiple computing nodes to generate a bootstrap sample in reasonable time. Future work 
could focus on more efficient bootstrapping. Some work on efficient bootstrapping was done by 
Kleiner et al. (2011), but it is restricted to simple models and not applicable here. 

Second, when the variational GEM algorithm is initialized at random locations, it may converge 
to local maxima whose LBml(*Y, 0\ ol) values are inferior to the solutions attained when the algo- 
rithm is initialized at the "true" values used to simulate the networks. While it is not surprising 
that variational GEM algorithms converge to local maxima, it is surprising that the issue shows up 
in some of the simulated data sets but not in the observed data set. One possible explanation is that 
the structure of the observed data set is clear-cut, but that the components of the estimated model 
are not sufficiently separated. Therefore, the estimated model may place non-negligible probability 
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-15.212 
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Table 1 

95% Confidence intervals based on parametric bootstrap using 500 simulated networks, with 1000 iterations for each 
network. The statistic Ei e i(y) Zik equals Ei Ej^i Vi* Zik, where Ziu — 1 if user i is a member of cluster k and 

Zik = otherwise. 




-0.02 -0.01 0.00 0.01 0.02 



Fig 4. Kernel density estimates of the five bootstrap samples of the trustworthiness parameters, shifted so that each 
component's estimated parameter value (shown in the legend) equals zero. 

mass on networks where two or more subsets of nodes are hard to distinguish and the variational 
GEM algorithm may be attracted to local maxima. 

Third, some groups of confidence intervals, such as the first four trustworthiness parameter 
intervals, have more or less the same width. We do not have a fully satisfying explanation for this, 
but note that, for a given partition of the set of nodes, the number of units carrying information 
about the component-specific excess parameter depends on the size of component as well as the sizes 
of all other components, and the width of the confidence intervals may depend on the magnitude 
of the component-specific excess parameter. 

In summary, we find that the clustering framework we introduce here provides useful results for a 
very large network. Most importantly, the sensible application of statistical modeling ideas, which 
reduces the unconstrained 109-parameter model to a constrained 8-parameter model, produces 
vastly superior results in terms of interpretability, numerical stability, and predictive performance. 

7. Discussion. The model-based clustering framework outlined here represents several ad- 
vances. An attention to standard statistical modeling ideas relevant in the network context im- 
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proves model parsimony and interpretability relative to fully unconstrained clustering models while 
also suggesting a viable method for assessing precision of estimates obtained. Algorithmically, our 
advances allow us to apply a variational EM idea, recently applied to network clustering models in 
numerous publications (e.g., Nowicki and Snijders, 2001; Airoldi et al., 2008; Daudin et al., 2008; 
Zanghi et al., 2010; Mariadassou et al., 2010), to networks far larger than any that have been con- 
sidered to date. We have applied our methods to networks with over a hundred thousand nodes 
and signed edges, indicating how they extend to categorical- valued edges generally or models that 
incorporate other covariate information. In practice, these methods could have myriad uses, from 
identifying high-density regions of large networks to selecting among competing models for a single 
network to testing specific network effects of scientific interest when clustering is present. 

To achieve these advances, we have focused exclusively on models exhibiting dyadic independence 
conditional on the cluster memberships of nodes. It is important to remember that these models 
are not dyadic independence models overall, since the clustering itself introduces dependence. How- 
ever, to more fully capture network effects such as transitivity, more complicated models may be 
needed, such as the latent space models of Hoff et al. (2002), Schweinberger and Snijders (2003), 
or Handcock et al. (2007). A major drawback of latent space models is that they tend to be less 
scalable than the models considered here. An example is given by the variational Bayesian algo- 
rithm developed by Salter- Townshend and Murphy (2010) to estimate the latent space model of 
Handcock et al. (2007). The running time of the algorithm is 0(n 2 ) and it has therefore not been 
applied to networks with more than n = 300 nodes and N = 89,700 edge variables. In contrast, 
the running time of the variational GEM algorithm proposed here is 0(n) in the constrained and 
0(f(n)) in the unconstrained version of the Nowicki and Snijders (2001) model, where f(n) is the 
number of edge variables whose value is not equal to the baseline value. It is worth noting that 
f(n) is 0(n) in the case of sparse graphs and therefore the running time of the variational GEM 
algorithm is 0(n) in the case of sparse graphs. Indeed, even in the presence of the covariates (which 
are not considered by Salter- Townshend and Murphy, 2010), the running time of the variational 
GEM algorithm is 0{C 2 n) provided the covariates are categorical with C categories. We have 
demonstrated that the variational GEM algorithm can be applied to networks with more than n = 
131,000 nodes and N = 17 billion edge variables. 

While the running time of 0{n) shows that the variational GEM algorithm scales well with 
n, in practice, the "G" in "GEM" is an important contributor to the speed of the variational 
GEM algorithm: Merely increasing the lower bound using an MM algorithm rather than actually 
maximizing it using a fixed-point algorithm along the lines of Daudin et al. (2008) appears to save 
much computing time for large networks, though an exhaustive comparison of these two methods 
is a topic for further investigation. 

An additional increase in speed might be gained by exploiting acceleration methods such as 
quasi-Newton methods (Press et al., 2002, Section 10.7), which have shown promise in the case of 
MM algorithms (Hunter and Lange, 2004) and might accelerate the MM algorithm in the E-step of 
the variational GEM algorithm. However, application of these method is complicated in the current 
modeling framework because of the exceptionally large number of auxiliary parameters introduced 
by the variational augmentation. 

We have neglected here the problem of selecting the number of clusters. Daudin et al. (2008) 
propose making this selection based on the so-called ICL criterion, but it is not known how the 
ICL criterion behaves when the intractable incomplete-data log likelihood function in the ICL 
criterion is replaced by a variational-method lower bound. In our experience, the magnitude of the 
changes in the maximum lower bound value achieved with multiple random starting parameters 
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at least as large as the magnitude of the penalization imposed on the log-likelihood by the ICL 
criterion; thus, we were unsuccessful in obtaining reliable ICL-based results for very large networks. 
More investigation of this question, and of the selection of the number of clusters general, seems 
warranted. 

By demonstrating that scientifically interesting clustering models can be applied to very large 
networks by extending the variational-method ideas developed for network datasets recently in the 
statistical literature, we hope to encourage further investigation of the possibilities of these and 
related clustering methods. 
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APPENDIX A: OBTAINING A MINORIZER OF THE LOWER BOUND 
The lower bound LBml(i^]cx) of the log likelihood function can be written as 

n K K n K 

(A.l) LB M l{i,0;ol) = EEE 

i<j fc=l 1=1 i=l k=l 

Since log vr^. ; / c /(0) < for all 6, the arithmetic-geometric mean inequality (Hunter and Lange, 
2004) implies that 

(A.2) u ik otji log iT dlj -ki(0) > ( a 2 ik^j- + <%ii£r- ) !og ^d iy ,u( e ) 

with equality if = aik and ctji = otji. In addition, using the concave nature of the logarithm, 

(A. 3) - log a ik > - log a ik - ^ + 1 

with equality if a^ = aik- Therefore, function Qml(i, 6, <*) as defined in (3.8) possesses prop- 
erties (3.9) and (3.10). 

APPENDIX B: CONVEX DUALITY OF EXPONENTIAL FAMILIES 

We show how closed form expressions of 9 in terms of tv can be obtained by exploiting the convex 
duality of exponential families. Let 

(B.l) W) = supjflV-V^)) 

be the Legendre-Fenchel transform of ip(0), where fi = fJ.(0) = Eg[g(Y)] is the mean-value param- 
eter vector and the subscripts k and I have been dropped. By Barndorff-Nielsen (1978, p. 140) and 
Wainwright and Jordan (2008, pp. 67-68), the Legendre-Fenchel transform of ip(6) is self-inverse 
and thus ip(0) can be written as 

(B.2) 1>(0) = supjflV-^V)} = sup{flV(w)-V*(M*-))}. 

where n(n) = X^deD 9(d) and iI)*(ii(tt)) = J^de'D ^d^ogTTd- Therefore, closed-form expressions of 
in terms of 7r may be found by maximizing T fi>(ir) — ip*(n(ir)) with respect to 7r. 
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APPENDIX C: GRADIENT AND HESSIAN OF LOWER BOUND 

We are interested in the gradient and Hessian of the lower bound LBml{i i@'> a ) °f t ne log 
likelihood function with respect to parameter vector 9. 
The lower bound LBml{7-> d;a) can be written as 

n K K n K 

(C.l) LB ML (7,0;a) = EEE 

a ik acji log 71-^.^(0) + ^2 ^2 a ik (log 7fc - log a ik ) . 

i<j k=l 1=1 i=l k=l 

The two examples of models considered in Section 2 assume that the conditional dyad probabilities 
take the form 

(C2) TdyjfcK*) = exp[ Vkl (0) T g(d l:j )-iP kl (e)}, 

where rjki(0) = A^iO is a linear function of parameter vector 9 and A k i is a matrix of suitable 
order depending on components k and I. It is convenient to absorb the matrix A^i into the statistic 
vector g{dij) and write 

(c.3) Kd&uiP) = eM^gU^-^mi 

where = Aj l g(dij). Thus LBml{i,6',o<) can be written as 

n K K 

(C.4) LB ML ( 7 ,e;cx) = EEE 

a ik atji 9 T gl l (d ij ) - ipki(Q) + const, 

i<j k=i i=i 

where "const" denotes terms which do not depend on 6 and 
(C.5) ^ki{0) = log^exp[0 T <^(d)]. 

Since the lower bound LBml{1i Q\ a ) 1S a weighted sum of exponential family log-probabilities, 
it is straightforward to obtain the gradient and Hessian of LBml^, 6] <*) with respect to 0, which 
are given by 

n K K 

(C.6) V,L5 A/L (7,0;<*) = Y.Y.Y. a ^ a 3l i9Udij) ~ MgUDio)]} 

i<j k=i i=i 

and 

n K K 

(C.7) VlLB M L(~f,0;a) = -EEE^^^^^hW' 

i<j k=i i=i 

respectively. 

In other words, the gradient and Hessian of LBml(i, 0\ a ) with respect to 6 are weighted sums of 
expectations — the means, variances, and covariances of statistics. Since the sample space of dyads 
T> is finite and, more often than not, small, these expectations may be computed by complete 
enumeration of all possible values of d £ D and their probabilities. 
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